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Abstract 

A Support Vector Machine (SVM) algorithm for multivariate density estimation is de- 
veloped based on regularization principles and bounds on the convergence of empirical 
distribution functions. The algorithm is compared to Gaussian Mixture Models (GMMs). 
Our algorithm outperforms GMMs for data drawn from mixtures of gaussians in IR 2 and 
]R 6 . Our algorithm is also automated with respect to parameters. 
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1 Introduction 

Estimating probability densities from a set of observed data points is a basic problem in statis- 
tics. In this paper we formulate a novel algorithm for multivariate density estimation that is 
mathematically well-founded, shows promise in preliminary experiments, and is automated with 
respect to parameter setting. 

Traditional approaches to estimating densities have been parametric. One assumes that the data 
are drawn from a parametric family of distributions and one then estimates the parameters of 
the family based upon the maximum likelihood principle. Gaussian Mixture Models (GMMs) 
are a particular case of the parametric approach. It is well known, even by peeople using this 
approach, that it is mathematically problematic [1] [11]. 

The problem of density estimation is an inverse operator problem which is ill-posed and stochas- 
tic. These problems are amenable to regularization approaches. The classical approach of the 
Parzen's window technique [6] can be derived from Tikhonov regularization [9], unfortunately 
this technique has not yielded good results for high-dimensional problems. We apply the residual 
method [7] of regularization to the density problem where the residual is set based on bounds on 
the convergence of empirical distribution functions. Setting the error functional in the residual 
method to a norm in a Reproducing Kernel Hilbert Space (RKHS) results in a Support Vector 
Machine (SVM) formulation. Previously, an SVM approach to density estimation was attempted 
[13], however it was only applied to univariate density estimation and performed very badly for 
the multivariate problem. Our algorithm works for both multivariate and univariate problems. 
We also present experimental results that show better performance than a GMM approach for 2 
and 6 dimensional data. 

2 A Support Vector Machine (SVM) approach 
2.1 Density estimation 

A probability density, p(t), is defined as the solution of the following equation 

p(t)dt = F(x), (1) 



where F(x) is the distribution function F(x) = IP{£ < x}, £ is a random variable. Estimating 
a probability density from data means solving this integral equation on a given set of densities 
p(t, a), a G A, when the distribution function F(x) is unknown and given a random independent 
sample {xi} i=1 obtained from this distribution. 
The empirical distribution function 

F t (x) = j'£e(x-x i ), (2) 

8 = 1 

is a good approximation of the actual distribution function F(x). The rate of convergence of 
this approximation is known asymptotically. For the univariate case, the random variable k 

k = V^sup \F(x) - F e (x)\ = VI\\F(x) - i^U 

X 

is independent of the distribution function. 

Hence one can consider the problem of density estimation as a problem of solving equation (1) 

using Ff(x) instead of F(x). Ff(x) converges to F(x) at a fast rate, 0(1 /y/ £). 



2.2 The density estimation problem is stochastic and ill-posed 

It is known that the problem of solving linear integral equation 

Af = F 

is ill-posed when the sequence Ff{x) of approximations are used on the right hand side instead 

oiF(x). 

Two forms of regularization can be used to solve ill-posed problems using approximations | \F(x) — 

Fi(x)\\ = Si. The idea of these methods is to introduce the so called regularizing functional 0(/) 

(semi-continuous, positive functional for which 0(/) < c is a compactum for all positive c) and 

define the solution / which is a trade-off between the value of the functional 0(/) and accuracy 

||Af-F,||. 

The two forms of this trade-off or regularization turn out asymptotically equivalent [12], they 

are the methods of Tikhonov [9] and Phillips [7] respectively, 



mm 

/ 



(||A/-F,|| 2 + 7 ,0(/)), 7 , >0, 



minO(/) s.t. \\Af - F f \\ < et, et > 0, 

where for certain conditions on constants 6t } 7^, and et the sequence of approximations converges 
to the desired solution. 

Both principles can be generalized for the stochastic case [11], in particular for the Tikhonov 
method it was shown that if function Ff(x) converges in probability to F(x) and 7^ — > then for 
any positive v and p there exists a positive number n(z/, p) such that for £ > n(z/, p) the following 
inequality holds 

P(p El (/, ft) > v) < P( PE2 (F, Ft) > V™J) (3) 

where pE 1 (f } ft) } PE 1 (f,fe) are metrics in the space of functions / and F. 

Since the empirical distribution function Ff{x) converges in probability to F(x) from equation 

(3) one concludes that estimating the density by solving integral equation (1) using Ff{x) is 

always consistent. 

The main problem in density estimation using a finite number of observations is specifying the 

regularization parameters. 

2.3 Regularization method for density estimation 

We solve the density estimation problem using the Philips' regularization form. We minimize 
the regularization functional 0(/) subject to constraint 

\\Af - F e \\ < vt, 

where at is the known discrepancy between F(x) and Ff{x). 

Usually it is not easy to evaluate at. However for the problem of density estimation one can 

obtain an exact estimate of at. 

As we discussed in section 2.1 for the one dimensional case the random variable k = yi,\\F(x) — 

-^(^)||oo has an universal distribution. Therefore one can choose at = -j= where c is a constant, 



for example the median of this distribution 0.6. For the multivariate case with probability 1 — rj 
the inequality 

ot < c( v ) (r k ^) 

holds true where k(d) is dehned by the dimensionality d of the space [4] [8]. These types of 
bounds are used to set the regularization parameter. 

2.4 SVM for density estimation 

To use the support vector method for the density estimation problem we look for a solution / in 
the set of functions that belong to some Reproducing Kernel Hilbert Space (RKHS) where we 
define the regularization functional 0(/) as a norm in the RKHS 

«(/) = II/IIh- (4) 

We minimize the functional (4) subject to constraints in a set of functions / £ H . It is known 
[3] that the solution of the optimization problem for / £ H has a form 

i 
f( x ) = ^2aiK(xi,x) 

8 = 1 

where the coefficients minimize the functional 

i 



«(«) = £«? (5) 

8 = 1 

subject to constraints 

||^-i^||oo<<T, 

where F(x) = ^p{t)dt. Minimizing (5) can be thought of as finding the smoothest function 
F(x) which is within an cr-tube of F((x). 

A diagnostic is needed to check whether an approximation F(x) exists that stays within the 
cr-tube. This is done by adding slack variables, £,- > 0. The addition of the slack variables to 
functional (5) and properties of the RKHS give us the following functional to minimize (which 
is identical to the functional minimized for SVMs for inverse operator problems [10]) 

i 
min|H| 2 + CV& (6) 

"^ 8 = 1 

subject to 

\F( Xi ) - F^ < a + &, Vxi, (7) 

where £,- are slack variables, C is set to infinity (in implementations C is set to a large number), 
and since we want to estimate a density the following constraints are added: 

i 
]>> 8 = 1 ,a,->0. (8) 

8 = 1 



Minimizing functional (6) with respect to the constraints in equations (7) and (8) is a quadratic 
programming problem with linear constraints and can be solved in either a primal or dual for- 
mulation [2]. The solution has the form 

i 
p(t, a) = Y^ OLiK{t, Xi, A), 

8 = 1 

where K(t } X{ } A) are kernel functions and A is another regularization parameter and most of the 
cti's are typically zero. 

The extension to the the multivariate case, x 8 £ IR , only involves constructing a multivariate 
empirical distribution function 

#M = £ n e(tf - x& (9) 

i = l j = l 

where xj is the j th component of the point x 8 -. Again the solution will have the form 

N 

p(t, a) = J2 <XiK(t, x,-, A), (10) 

8 = 1 

where N are the points x 8 - for which a 8 - is nonzero, these are the Support Vectors, in general 
N <C £. The parameter A, an example of which is the variance of a gaussian, is set by a line 
search: we select the largest A for which there exists an estimate within a cr-tube of the empirical 
distribution function. 

Note that this algorithm has only one free parameter a which is set based on bounds on the 
convergence of -F^(x) to F(x). 

2.5 Comparison to Gaussian Mixture Models (GMMs) 

When the kernel functions are gaussians the SVM solution is a linear combination of gaussians 
just like the GMM case. The difference between the two cases is how the inverse operator 
problem is solved. In the GMM case one uses the expectation maximization (EM) algorithm to 
estimate the parameters that maximize the log-likelihood. However, this maximum may not exist, 
from Basu and Micchelli " Generally a maximum likelihood does not exist." [1]. Regularization 
heuristics are imposed to deal with this problem. The two basic heuristics are to set a lower 
bound on the variance of the gaussians and also to specify an upper bound on the number of 
gaussians in the model. The problem is that there is no rigorous way to set these bounds which 
serve as the regularization parameters. 

Both the SVM and GMM methods require regularization to effectively solve the density estima- 
tion problem. The problem for the GMM approach is that it is difficult to set the regularization 
parameters in a mathematically rigorous way. In the SVM method the only free regularization 
parameter oi is set based upon the convergence properties of empirical distribution functions. 

3 Simulations 

In this section we apply our algorithm to a 2-dimensional density estimation problem and a 
12-dimensional density estimation problem. We compare the results to those achieved by a 
GMM. 



3.1 2-dimensional case 

Training data for 100 trials was generated by randomly sampling 200 data points for each trial 
from the following distribution 

D ( x y) = ^ e -((^-l) 2 /2 + (l/-l) 2 /8) , J_ -((^+l) 2 /8+(l/+l) 2 /2) 

n ' y > 8vr 8tt 

For the test data 1000 points were sampled from the above distribution. We applied our algorithm 
using a gaussian kernel 

K({t x ,t y },{ Xi , yi },\) = J- e -((*— 2 + (*»-«) 2 )/2A» 

and a = , 6 . We compared our algorithm to that of a GMM. The GMM algorithm consisited of a 
vector quantization and k-means clustering step followed by the expectation maximization (EM) 
algorithm to compute the maximum likelihood. The GMM required as parameters the maximum 
number of gaussians allowed and the minimum variance allowed. We set the maximum number 
of gaussians to 4 and 2 and the minimum variance allowed to .01. For each trial we computed 
the C 1 error or average absolute error between the estimate and true distribution. Figure (la) 
shows that our algorithm does slightly better than the GMM with two gaussians and much better 
than the GMM with four gaussians. Figure (2) plots the true density and the three estimated 
densities for the hrst trial. 

3.2 6-dimensional case 

Training data for 50 trials was generated by randomly sampling 600 data points for each trial 
from the following distribution 

nfv) - ( -^(x-^i) T sr 1 (x-^i) I + -^(x- At2 ) T S7 1 (x- At2 ) /i-|\ 

PK) ~ 24 7 r 3 Uet|E 1 | det|E 2 | l ' 



i l r -J,(x-^) T sr 1 (x-^3)^ 
det|E 3 | J 



where x £ IR 6 , the covariance matrices are diagonal with the following elements 
Si = {1.0,2.0,1.0,2.0,1.0,2.0}, 

5 2 = {2.0, 1.0, 2.0, 1.0, 2.0, 1.0}, 

5 3 = {2.0, 1.0, 2.0, 1.0, 2.0, 1.0}, 

//i = {1.0,1.0,1.0,1.0,1.0,1.0}, 

// 2 = {-1.0, -1.0, -1.0, -1.0, -1.0, -1.0}, 

/i 3 = {0.0,0.0,0.0,0.0,0.0,0.0}. 

For the test data 6000 points were sampled from the above distribution. We applied our algorithm 

using a gaussian kernel 

Kit x- A) - l e -l|t-x,|| 2 /2A2 

and a = Q 5 1/3 . We again compared our algorithm to that of a GMM. We set the maximum 
number of gaussians to 8 and 4 and the minimum variance allowed to .01. For each trial we 
computed the C 1 error or average absolute error between the estimate and true distribution. 
Figure (lb) shows that our algorithm does slightly better than the GMM with four gaussians 
and much better than the GMM with eight gaussians. 
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(a) (b) 

Figure 1: A boxplot of (a) the I 1 error over 100 trials for SVM, GMM with 2 gaussians, and 
GMM with 4 gaussians in the two dimensional case and (b) the C 1 error over 50 trials for SVM, 
GMM with 4 gaussians, and GMM with 8 gaussians in the six dimensional case. 
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(c) (d) 

Figure 2: For the hrst trial in the two dimensional case (a) the true density (b) the SVM estimate 
(c) the GMM estimate with 2 or fewer gaussians (d) the GMM estimate with 4 or fewer gaussians. 



4 Future work 

In order to apply this algorithm to high dimensional problems with many data points in the 
training set we will implement the dual formulation using the decomposition algorithm of Osuna 
et. al. [5]. We also need a better understanding of which bounds to use to obtain the a parameter. 
We will look at speech data as well as simulated data to examine general convergence behavior 
for practical problems so as to determine this parameter. 
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